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Abstract: Hidden Markov models have successfully been applied as models of discrete time 
series in many fields. Often, when applied in practice, the parameters of these models have 
to be estimated. The currently predominating identification methods, such as maximum- 
likelihood estimation and especially expectation-maximization, are iterative and prone to 
have problems with local minima. A non-iterative method employing a spectral subspace-like 
approach has recently been proposed in the machine learning literature. This paper evaluates 
the performance of this algorithm, and compares it to the performance of the expectation- 
maximization algorithm, on a number of numerical examples. We find that the performance is 
mixed; it successfully identifies some systems with relatively few available observations, but fails 
completely for some systems even when a large amount of observations is available. An open 
question is how this discrepancy can be explained. We provide some indications that it could 
be related to how well-conditioned some system parameters are. 
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1. INTRODUCTION 

Hidden Markov Models (HMMs) are standard tools for 
modeling discrete time series. They have, among oth¬ 
ers, been applied to such disperse fields as speech recog¬ 
nition (e.g. Rabiner (1989), Gales and Young (2008)), 
genomic sequence analysis (e.g. Eddy (1996)) and fi¬ 
nancial stock prediction (e.g. Hassan and Nath (2005)). 
The parameters of these models are usually unknown 
and have to be estimated from data when applied in 
practice. Methods for performing this task, such as the 
Expectation-Maximization (EM) algorithm (also referred 
to as the Baum-Welch algorithm when applied specifically 
to HMMs), are usually iterative and might encounter prob¬ 
lems with local minima and slow convergence (see e.g. Hsu 
et al. (2012)). 

A well-known approach for the identification of linear 
states-space models is to employ subspace techniques (e.g. 
Van Overschee and De Moor (1996), or Ljung (1998)). 
This approach is non-iterative and avoids the concern of 
local minima that can cause problems when employing 
EM. There have been attempts to apply similar methods 
to HMMs (e.g. Vanluyten et al. (2007), Hjalmarsson and 
Ninness (1998), Anandkumar et al. (2012b), and Hsu 
et al. (2012)). One of the difficulties with adapting these 
techniques is that the HMM has some restrictions that 
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are not present for regular linear systems, namely that 
some parameters represent probabilities and thus have to 
be non-negative and sum to one. 

This paper is concerned with the so-called spectral learning 
algorithm proposed by Hsu et al. (2012). The idea of 
their method is to relate observable quantities, correlations 
in pairs and triplets in a sequence of observations, to 
the system parameters. The observable quantities are 
estimated from the available data and used to recover 
estimates of the parameters of the HMM by reversing the 
relations. This is fundamentally a method of moments. It 
has been generalized to other models and put in a tensor 
decomposition framework in subsequent work, see e.g. 
Anandkumar et al. (2012a), Anandkumar et al. (2012b), 
and Anandkumar et al. (2014). 

The contribution of this paper is an evaluation of how 
the spectral learning algorithm performs on a number 
of systems. We find that its performance is varied. It 
identifies some systems well using a small amount of 
observations, but fails to identify some systems even with 
a large number of observations. 

This paper is structured as follows: The HMM is formally 
introduced in Section 2, along with statistical moments. 
Section 3 states the identification problem for HMMs. 
Section 4 outlines the spectral learning algorithm by Hsu 
et al. (2012). Section 5 discusses how the correctness of the 
estimates is measured. Section 6 provides numerical results 
from simulations for SL and EM. Section 7 concludes the 
paper with a brief discussion of the results. 



2. PRELIMINARIES 


Vectors in this paper are assumed to be column vectors 
and we use the symbol ^ to denote an empirical estimate 
of a numerical quantity. 

2.1 Hidden Markov Models 

The HMM is a generalization of the Markov chain, which 
is a discrete stochastic model that can be represented as 
a sequence of stochastic variables, say xi,X 2 ,... These 
variables take values in some set X = {1, 2 ,..., A}, which 
is called the state-space (of dimension X G N'*'). The 
crucial assumption of a Markov chain is that the Markov 
property, 

Pr[a:fc+i = i\xi = ii,X 2 =i 2 ,...,Xk = ik] 

= PT[xk+i = i\xk = ik], ( 1 ) 

holds. 

The quantity Pr[a;fc+i = i\xk = j] is called a transition 
probability. If these quantities do not depend on absolute 
time, then it is possible to summarize all transition prob¬ 
abilities in a constant matrix T € [0,1]^^^, known as the 
transition matrix, with elements 

[T]ij = Pr[a;fc+i = i\xk = j]. (2) 

This matrix is column stochastic, i.e., it has to satisfy 
elementwise non-negativity, 

[T],, e [0,l]Vi,i = 0,l,...,A, (3) 

and the elements in each column must sum to one, 

^[T],, = lVj=0,l,...,A. (4) 

i=l 

The state of the Markov chain is not directly observed 
in an HMM, but rather an observation yk is made out 
of a finite set y = {1,2, ...,V} at each time instant k, 
where Y G N"*" is the number of possible observations. 
The observation made at time fc is a stochastic variable, 
but is related to the (hidden) state of the Markov chain. 
The observation probabilities of an HMM can be written 
as a matrix O € [0,referred to as the observation 
matrix, with elements: 

[0]ij = Pr[yk = i\xk = j]. ( 5 ) 

This matrix is also column stochastic and satisfies similar 
relations as those for T. 

These two quantities along with the initial distribution 
TTo G [0,1]^ of the Markov chain, defined as 

[TToJi = Pr[a;i = f], (6) 

completely specify the HMM. A more complete description 
of the HMM can be found in Cappe et al. (2006). 

2.2 Moments 

In the spectral learning algorithm, the joint probabilities 
of n-tuples of consecutive observations are decomposed 
into the parameters of the HMM. These are called nth 
order moments. Only the first, second and third order 
moments are utilized, which are vector, matrix and third 
order tensor quantities, respectively, and which are defined 
as follows: 

(7) 


for f = 1 , 2 ,..., y, and 

[5'2,i]ij = Pr[ 2/2 =i,yi= j], (8) 

for i,j = 1,2, ...,y. The third order moments can be 
written as Y matrices if one index is fixed: 

[S3,y,l]^J = Pr[y 3 = b 2/2 = y, 2/1 = j], (9) 

for i,j,y = 1, 2,... , Y. Here, yi refers to the first item in 
a triplet of consecutive observations, j /2 to the second and 
j /3 to the third. 

Note that it is possible to define other moments of the 
same orders, for example S' 34 , in an analogous fashion. 

These quantities can straightforwardly be estimated from 
data by sampling triplets of observations from the HMM 
and calculating the relative frequency of different single- 
tons, pairs and triplets. If a long consecutive sequence of 
observations is available, then either a sliding window or 
an independent sampling approach can be used to generate 
triplets. 

3. PROBLEM FORMULATION 

The problem we aim to solve is the following: At hand is 
an HMM-system with unknown parameters. The data that 
we have available is a sequence of consecutive observations 
from the system. To be able to construct for example an 
HMM-filter, we want to fit parameters of an HMM such 
that it models the observed data well. We assume that we 
know the number of hidden states X and the number of 
possible observations Y of the true system. 

4. THE SPECTRAL LEARNING ALGORITHM 

Hsu et al. (2012) are not mainly concerned with recovering 
explicit expressions for estimates of the parameters T, O 
and TTo, but rather intend to identify another parametriza- 
tion of the HMM. This parametrization allows them to 
calculate the probabilities of sequences of observations, 
for example, the joint distribution Pr[?/i, j/ 2 , • ■ • > 2 /fe] s-nd 
the conditional distribution Pr[yk+i\yi,y 2 , ■ ■ ■ ,yk]. They 
describe in an appendix how the method by Mossel and 
Roch (2006) can be used in conjunction with their method 
to recover explicit expressions for T, O and tto . This 
combined method will be outlined below, and is what we 
refer to as Spectral Learning (SL). 

It is possible to derive the following relations between 
different moments and the parameters of the HMM: 


II 

0 

0 

(10) 

' 2,1 = OT diag(7ro)0^. 

(11) 

' 3.1 = diag(7ro)0^ 

(12) 

OT diag(ey 0)r diag(7ro)0^, 

(13) 


where Ci is the fth canonical (column) unit vector. Details 
can be found in e.g. Hsu et al. (2012), Mattila (2015) and 
Johnson (2012). 

Hsu et al. (2012) make the assumptions that ttq > 0 
elementwise and that T and O have rank A. To be able 
to handle the case that F > A, a matrix U of the left 
singular values of 5'2,i corresponding to the A largest 
singular values is introduced. 


[Si]i = Pi[yi = i] 
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Fig. 1. Performance of SL on seven systems. All points in the plots are averaged over 20 simulations. The two leftmost 
plots show the MSE of the estimated observation (O) and transition (T) matrices. The rightmost plot shows the 
fraction of the 20 simulations that resulted in some element of T or O being negative or having a non-zero imaginary 
component. 


Algorithm 1 Spectral Learning from Hsu et al. (2012) 

1 ; Sample triplets of observations (yi,y 2 ,y 3 ) from the 
HMM and form empirical estimates Si, >S 2 ,ij Ss^i and 
S 3 ,y,i for y = ^ 

2: Calculate the SVD of ^ 2,1 and form U by taking the 
left singular vectors corresponding to the X largest 
singular values as columns. 

3: Generate Y normally distributed random variables 
{gy^N{0,l)-.y = 

4: Perform an eigendecomposition of the left hand side of 
(16) when the above estimates are used, i.e. of 

y=i 

5: Use the matrix of eigenvectors from Step 4 to diago¬ 
nalize (U'^S 3 ^y^i)(U^S 3 ^i)~^ for y = 1,... ,Y and take 
the diagonal as row y of O. The diagonalizations are 
performed by multiplicating from the left and by the 
inverse from the right in accordance with (15). 

6 ; Calculate 

TT ^ d+,Si 

and 

6+52,1(0+)^diag(7ro)-i. 

7; Return O, T and ttq. 


By employing the relations given above, it can be shown 
that the following expression holds for y = 1,2,... ,Y-. 

S 3 ,y,i){U^ S 3 ,i)+ = (U^Or)diag(e^O)(U^OT)-i, 

(14) 

where denotes the Moore-Penrose pseudo-inverse. 

What should be noted here is that everything on the left 
hand side of (14) can be estimated from data and that the 
right hand side is an eigenfactorization where the diagonal 
matrix contains the j/th row of the observation matrix. 

The idea of SL is to recover one row of O at a time using 
the trivial reformulation of (14): 

{U^OT)-\U^S 3 ,y,i){U^S 3 ,i)+{U^OT) = diag(e^O). 

(15) 


Notice that the same transformation {U'^OT) diagonalizes 
(14) for every y. This is important for preserving the order 
of the eigenvalues. 

The transformation could be recovered for any y, but to 
increase the robustness of the method, a set of random 
variables, gi ^ A/’(0,1) for z = 1,2,..., Y, is introduced. 
Here, Af {y., cr^) is the normal distribution with mean value 
y and variance cr^. It can be shown from (14) that 

Y 

Y.gy{U^S3,y,l){U^S3,l)+ = 

y^l 

Y 

(U^OT){^5,diag(e^O)}(C/^Or)-i (16) 

!/=l 

holds. 

Thus, by performing an eigendecomposition of the left 
hand side of (16), we recover the diagonalization trans¬ 
formation iU'^OT), up to scaling and permutation of the 
columns. This matrix is then utilized in (15) to recover 
each row of O. 

Remark 1. If for each y an eigendecomposition of (14) is 
performed directly using some numerical method, then the 
elements of O will be recovered, but since eigenvalues do 
not have any intrinsic ordering, as O is assembled (row 
by row), every row will have its elements sorted according 
to the order in which the eigenvalues were returned by 
the method performing the eigendecomposition. Since the 
columns of O correspond to the different hidden states of 
the HMM, a consistent ordering is required. By using the 
same diagonalization matrix — the one recovered in (16) 
— we guarantee that the eigenvalues have a consistent 
ordering. 

Once O is recovered, Hsu et al. (2012) suggest that 
the following expressions be used to recover the other 
parameters: 

^0 = 0+^1 (17) 

and 


T = 0+^2.i(0+)^diag(^o)-'. 


(18) 
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To employ the method on real data, estimates are to 
be used for the moments Si, S'2.1) S^^i and S'a.j,,!- This 
concludes the SL algorithm, which for convenience is 
summarized in Algorithm 1. 

Remark 2. Since in our problem formulation we make the 
assumption that all our observations are consecutive, un¬ 
like Hsu et al. (2012) who independently sample triplets 
of observations to estimate the different moments, our es¬ 
timate of the initial distribution will be an estimate of the 
stationary distribution of the Markov chain corresponding 
to the transition matrix T. 

5. ERROR MEASURE 

We use the Mean Squared Error (MSE) of the elements 
in the relevant matrices as a measure of the error of the 
estimates. For two matrices A,A S the MSE is 

dehned as 

MSE(A,i) = , (19) 

m X n 

where || • Ijj;’ denotes the Frobenius norm. 

It is important to notice that the columns of O, and 
the rows and columns of T, can be permuted compared 
to those in O and T, depending on the method used to 
calculate the diagonalization transformation of (16). It is 
thus not meaningful to calculate and present, for example, 
MSE(0, d) directly. 

The columns of O, and the rows and columns of T, have 
to be permuted as to line up with those of the original 
matrices used to generate the sequence of observations if 
MSE(-,') is to make sense. This is a simple combinatorial 
problem which is solved in the results to be presented. 
The intuition behind this is that the order of the hidden 
states can not be discerned from external observations of 
the HMM (nor in the case of a regular linear system). It 
follows from the SL algorithm that HMMs are generically 
identifiable up to these permutations. 

Also notice that there is nothing in SL that enforces the 
stochasticity constraints on the estimates of T, O and ttq. 
The elements of O are the result of an eigendecomposition, 
and can therefore turn out to be negative, or even complex, 
numbers. The elements of T and ttq might also lie outside 
of [0,1], since they are calculated from O. This is currently 
one of the main difficulties of using SL. 

6 . NUMERICAL RESULTS 

This section presents numerical results for SL and EM. 
All simulations were performed on a 1.3 GHz MacBook 
Air with 4 GB RAM. 

6.J Spectral Learning 

The performance of SL ^ was evaluated on seven systems 
(taken from standard texts on HMMs or conceived) and 
the results are presented in Figures I and 2. The systems 
had three hidden states, and three (Examples 2, 4, 5, 6 
and 7) or ten (Examples 3 and 8) possible observations, i.e. 

^ The implementation and simulations were performed using 
MATLAB R2013a. 




Fig. 2. Computational time, as measured from the point 
that data is given to the algorithm until estimates of 
the parameters are returned, for SL and EM on the 
different systems. Points in the plots are averaged over 
3 simulations for EM and 20 for SL. Notice the scales 
of the time-axes in the two plots. 

X = 3 and Y G {3,10}. The exact numerical parameters 
(i.e. the transition and observation matrices) for each 
example can be found in Mattila (2015). Every point in 
the plots is the average of 20 simulations. 

As previously mentioned, there is no guarantee that the 
estimates recovered by SL are (stochastically) valid. In 
the rightmost plot of Figure I, the fraction of these 20 
simulations that resulted in estimates with elements with 
negative or non-zero imaginary components is presented. 

SL demonstrated a mixed performance on the examples on 
which it was evaluated. It converged for a relatively small 
amount of samples on Examples 3 and 6, but failed to 
converge even for large amounts of samples for Examples 
4, 5 and 8. This can be seen both from the error in the 
estimates of T and O, but also from the fraction of invalid 
estimates. The performance on Examples 2 and 7 was in 
between these other two performance classes. 

It is worth noting that the slopes of the errors appear to 
be constant once SL starts providing a large fraction of 
valid estimates, which happened for different amounts of 
samples for the different systems. 

6.2 Expectation-Maximization Method 

For a thorough treatment of the EM-algorithm, see 
McLachlan and Krishnan (2008). The implementation pro¬ 
vided in MATLAB (hinintrain) was used with the maxi¬ 
mum number of iterations taken as the default value of 
500. Three of the examples on which SL had varying 
performance were considered. In Figures 2 and 3, the EM- 
algorithm was started with random matrices as initial 
guesses for the HMM parameters. 

EM performed better, errorwise, than SL on Examples 2 
and 4 when a small amount of observations was available. 
However, as is apparent in Figure 2, the time-scale is orders 
of magnitude larger than that of SL. For all of the examples 
that were considered, when about 10® available samples 
were available, SL delivered estimates in less than a tenth 
of a second, whereas EM required about ten minutes. 

The initial guesses for the parameters in EM play an 
important role as can be seen in Figure 3; the error of the 
estimated T matrix increased for two of the systems as 
more and more of samples were available. This is probably 


























































Fig. 3. MSE of the estimates of the observation (O) and 
transition (T) matrices for the EM-algorithm on three 
different systems when random matrices were used as 
initial guesses. 

due to the random initial guesses being worse in those 
simulations. 

In Figure 4, the true parameters of the HMMs were used as 
initial guesses in the EM algorithm. To make a comparison 
between EM and SL easier, we also plot the performance 
of SL (from Figure 1) as dashed lines in the same figure. 

The error for EM when started at the true system param¬ 
eters is closely related to the Cramer-Rao bounds for how 
well any estimation method can perform. SL was close to 
the performance of EM for Example 3, but was far off for 
the other two examples. Note that the EM curves lie higher 
for those two examples than for Example 3, implying that 
the maximum of the likelihood function for the available 
data lies further away from the true maximum. 

6.3 Relation to Condition Numbers 


Table 1. The examples from Figure 1 sorted according 
to the error in the estimate of the O-matrix when 10^ 
samples were used, together with the condition number of 
the product OT. 


Ex. 

Error of O-estimate with 10^ samples 

cond(OT) 

4 

2.75x10“^ 

115.4 

5 

4.42x10-2 

27.1 

8 

1.72x10-2 

208.3 

7 

6.06x10-^ 

21.6 

2 

4.55x10-® 

10.8 

6 

4.66x10-® 

5.4 

3 

3.03x10-^ 

2.6 


Since the matrix U in SL is not absolutely specified (the 
choice as some of the singular vectors is just a suggestion, 
see Hsu et al. (2012) for details), we provide in Table 1 
the condition number of OT for the different systems in 
Figure 1. We have sorted the examples after what the MSE 
was when 10^ samples were available in the estimation 
procedure. The condition numbers are perfectly sorted 
except for the outlier Example 8. This suggests that the 
condition number of OT could give some indication on how 
well SL will perform for a given system. 

7. CONCLUSION 

We have implemented and evaluated the algorithm out¬ 
lined in Hsu et al. (2012), which uses spectral methods to 
deliver one-shot estimates of the parameters of an HMM. 
We saw that the performance of the algorithm is overall 
good, but that it fails completely for some systems, and 
that the performance could be related to a certain condi¬ 
tion number. 


The mixed performance of SL, compared both to EM and 
to itself on the different examples, could be due to many 
reasons. One could be that there is a loss of information 
when only the first, second and third order moments are 
used, instead of moments up to the length of the whole 
observation sequence. 

Regarding the discrepancy between the different examples, 
a simple perturbation analysis (Mattila (2015)) shows that 
the condition number of (UOT) is related to the upper 
bound on how far the eigenvalues might move in (14) 
(i.e. the elements of one row of the recovered observation 
matrix) due to perturbations in the estimates of the 
moments. 



Fig. 4. MSE of the estimates of the observation (O) and 
transition (T) matrices for the EM-algorithm on three 
different systems when the true matrices were used as 
initial guesses (solid), along with the corresponding 
results of SL from Figure 1 (dashed). 


We also observed that SL is orders of magnitude faster 
than EM, especially when a large amount of observations 
is available. This suggests that SL could be a preferable 
alternative to EM in cases where there is a large number 
of observations available from the HMM. The recovered 
estimates can then be refined, if necessary, using for 
example EM. Note that this assumes that the estimates 
recovered from SL are stochastically valid, which was not 
the case for many systems. SL started providing valid 
estimates when different amounts of samples were available 
for the different systems. 
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